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We generalize the method of surrogate data of testing for 
nonlinearity in time series to the case that the data are sam- 
pled with uneven time intervals. The null hypothesis will 
be that the data have been generated by a linear stochas- 
tic process, possibly rescaled, and sampled at times chosen 
independently from the generating process. The surrogate 
data are generated with their linear properties specified by 
the Lomb periodogram. The inversion problem is solved by 
combinatorial optimization. 

The vast majority of methods of time series analysis 
deals with data measured at times which are an inte- 
ger multiple of the fixed sampling interval A. Unevenly 
sampled time series are often excluded although they are 
quite common in cases where measurements are restricted 
by practical conditions. For example, most astronomical 
observations cannot be made in the day time and must 
often be interrupted in the night due to cloudy weather; 
data from the stock exchange has gaps during the week- 
ends and holidays, etc. The reason for excluding these is 
mainly technical: most methods cannot be easily gener- 
alized to this case. This is particularly true for nonlinear 
methods of time series analysis . 

For the nonlinear approach to a given time series, we 
first have to ask for signatures of nonlinearity in the gen- 
erating process. In this paper for the first time we present 
a method to test for nonlinearity in unevenly sampled 
time series. We will extend the common concept of sur- 
rogate data where randomized data sets are used to 
obtain a Monte Carlo approximation to the probability 
distribution of a suitable test statistics. 

For this approach, we have to be able to generate se- 
quences that are random except for their linear correla- 
tions. Any additional structure can lead to spurious pos- 
itive results. The null hypothesis in this paper is that the 
data have been generated by a linear stochastic process 
that is measured instantaneously and possibly rescaled. 
This excludes correlations or even a deterministic rela- 
tionship between the sampling intervals and the observ- 
able. This method can however be generalized to that 
case. 

Time series taken at varying time intervals are very 
difficult to handle and only few successful efforts have 
been made in this area. In certain situations, one can 
interpolate the data to equally spaced sampling times. 
However, in a test for nonlinearity, one could then not 
distinguish between genuine structure and nonlinearity 
introduced spuriously by the interpolation process. 



If we want to generate surrogate data sets with the 
same linear properties as the original data, the autocorre- 
lations have to be conserved. But even such simple things 
like autocorrelations are hard to maintain for unevenly 
sampled data. Any time interval can occur between suc- 
cessive points and it is possible to combine them to nearly 
arbitrary lags. One idea is to calculate autocorrelations 
by binning all possible intervals to the desired, discrete 
lags, a process that involves some nonlinearity. Using 
these autocorrelations for generating surrogates can lead 
to the spurious rejection of purely linear time series. 

Besides the generation of surrogates, we have to be 
able to measure the degree of nonlinearity in the data. 
In contrast to the process of generating surrogates, inter- 
polations are permitted here as part of the specification 
of a test statistic. A badly designed test statistic could 
in the worst case lower the discrimination power of the 
test, while still keeping it formally correct. In this paper 
we use a very simple test statistic that measures non- 
linearity through deviations from time-reversibility. The 
main emphasis is laid on the generation algorithm for the 
surrogates. 

Standard surrogate methods make use of the Fourier 
transformation to conserve the autocorrelations of the 
original data. The method of amplitude adjusted Fourier 
transformation (AAFT, 0]) rescales the original time se- 
ries to a Gaussian distribution first. Then, the Fourier 
phases are randomized and the Fourier transformation 
is inverted. Finally, a rescaling to the original distribu- 
tion is performed. A refined method has been suggested 
in Ref. || where an iteration scheme is used to simulta- 
neously conserve the spectrum and the distribution. It 
consists of alternating Fourier transformation and rescal- 
ing steps. Both methods cannot directly be applied to 
unevenly sampled data, because they utilize the Fourier 
transformation and its inverse. 

In Ref. H , a general approach to the constrained ran- 
domization of time series is described. There, a desired 
property of the surrogate data is expressed through a cost 
function. The minimum of this cost function is reached 
if the surrogate fulfills the given property. In this paper, 
we will make use of this method with a cost function that 
can also be defined for unevenly sampled data. This will 
enable us to impose the desired linear correlation struc- 
ture on the surrogate time series. 

Let {y n } be a time series sampled at times {t n } that 
need not be equally spaced. The Power spectrum can 
then be estimated by the Lomb periodogram ||]. This 
spectral estimator is discussed e.g. in Ref. [[§. Here we 
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give the final formula: 
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where r is defined by 



tan(2a>r) 



J2 n s m 
J2 n cos 2wi„ 



(1) 



(2) 



and y, a 2 are the mean and the variance of the data, 
respectively. The result can be derived by fitting a least 
squares model y n — a cos u>t n + b sin u>t n to the data for 
each given frequency w. Therefore Lomb periodograms 
are often referred to as least squares periodograms. 

For time series sampled at constant time intervals, 
A = t n — t n -i for all n, the Lomb periodogram P 
yields the standard squared Fourier transformation. Ex- 
cept for this particular case, there is no inverse transfor- 
mation for the Lomb periodogram, which makes it im- 
possible to use the standard surrogate data algorithms 
mentioned above. Therefore, we follow the general ap- 
proach of Ref. Q, where constraints on the surrogate 
data are implemented by a cost function E({y n }) which 
has a global minimum when the constraint is fulfilled. As 
in Ref. [Q this cost function will be minimized by simu- 
lated annealing @^]. Starting with a random permuta- 
tion of the original time series, the surrogate is modified 
by exchanging two points chosen at random. The modi- 
fication will be accepted if it yields a lower value for the 
cost function or else with a probability p — exp(— AE/T). 
The "system temperature" T will be lowered slowly to 
let the system settle down to a minimum. This idea goes 
back to Metropolis et al. Q. See for example JlO| for 
details. 

In our case, we use the Lomb periodogram of the data 
as a constraint for the surrogates. It can be expressed as 
a cost function for example by: 



E = 



k=l 
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We use P at Nf equally spaced frequencies kujQ, other 
choices are possible. The parameter q specifies the dis- 
tance measure between the two periodograms. For q = 2, 
the .L 2 -distance is used. For the following examples we 
use q = 1. Higher "penalties" for large differences in 
single frequencies could be given by raising q above 1. 
Alternatively, one could use the differences between the 
square roots or logarithms of P, which puts less stress 
on the peaks in the power spectrum than (||). Another 
freedom lies in the choice of the minimum frequency lvq 



and the number of frequencies Nf and one may have to 
consider different values for each individual application. 

Recall that we are interchanging only the values of the 
time series y n , while fixing the times t n where measure- 
ments are made. This excludes data like inter-beat inter- 
vals of electrocardiograms where we have the additional 
condition t n+1 = t n + y n . 

For the calculation of surrogates, simulated annealing 
is performed until E has fallen below a given value Ef, 
the desired accuracy of the Lomb periodogram. In Monte 
Carlo simulations, new configurations are usually gener- 
ated from "older" configurations. In order to avoid cor- 
relations between the different surrogates, we prefer to 
start with a completely random permutation of the orig- 
inal time series for each surrogate. The starting temper- 
ature can roughly be determined by calculating the cost 
function for some randomly shuffled data sets and choos- 
ing To as a "typical" difference in cost between them. 
Once an adequate starting temperature is known, it can 
be used for further surrogates. 

Two improvements that accelerate the annealing al- 
gorithm have been made. The first is to choose the 
two points that are candidates for an exchange with 
a probability that depends on their difference in mag- 
nitude. Exchanging two points with a big difference 
in their rank (eg. the smallest and the largest value) 
yields a larger change of the cost function than ex- 
changing two points that do not differ much in their 
ranks. For good performance, it is desired to keep ap- 
proximately the same acceptance rate through all tem- 
peratures. This can be achieved by choosing pairs of 
points {yi, yj} with i ^ j and probability Pij{d, fi) where 
d = |rank(j/j) — rank(j7j)| — 1. The probability p is cho- 
sen to have a maximum for d = and to decrease for 
higher d. The parameter /i characterizes the "width" of 
the distribution p and should be varied proportional to 
N/T. The exact shape of p does not seem to be of much 
importance. For example, we were not able to observe 
a significant difference in performance between exponen- 
tial and Gaussian distributions. But in all cases we con- 
sidered, a non-uniform pij substantially accelerated the 
annealing process, so that higher accuracy is reachable 
with the same computational effort. 

Calculating E is very time consuming for long time 
series and many frequencies. With a typical value of 
Nf oc N we have an algorithm of order N 2 for each an- 
nealing step. Additionally, the number of annealing steps 
grows at least linearly with N. For our applications, it 
is not necessary to recalculate all sums in (Q) for every 
exchange, because we only change the values y n while 
fixing the times t n and frequencies kujo. In (Q), r and 
the two denominators do not depend on y n and can be 
stored in arrays for every frequency kujQ at the beginning 
of the annealing process. The sums in the numerator do 
not change much either and only two terms have to be 
recalculated. This reduces the recalculation of the Lomb 
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FIG. 1. Delay-plot of an unevenly sampled Henon map 
and one surrogate. The test finds a significant difference in 
time-asymmetry. 
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FIG. 2. Lomb periodogram of data set E. See text for 
details. 



periodogram to order N. 

But even with the described modifications to the al- 
gorithm the actual annealing is quite computer time in- 
tensive. Generating one surrogate for a time series with 
TV = 2000 points while fixing Nf — 1000 frequencies 
takes about 5 hours CPU-time on a DEC alpha worksta- 
tion at 166 MHz. 

So far, we have described how to produce randomized 
versions of unevenly sampled time series with given linear 
correlations, which is the main point of this paper. Let 
us now demonstrate, how such surrogate sequences can 
be used in tests for nonlinearity. For this purpose, we 
have to be able to measure the degree of nonlinearity in 
a time series. Many statistics that have proven useful for 
evenly sampled time series (see e.g. 11 ) cannot easily be 



generalized to unevenly spaced data. This generalization, 
in general, is a topic of future research. 

Here, as a first simple statistic, we choose a measure 
for time-reversibility, which is a good indicator for non- 
linearity. It is however not very enlightening about what 
source of nonlinearity there might be. For the data sorted 
in time order, 
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is calculated, which is just the mean of the slopes, taken 
to the third power. For a time series generated by a 
linear process, and for the surrogates, we expect 7 ss 
0. In contrast, time series with nonlinearities can be 
asymmetrical in time and may yield values of 7 7^ 0. To 
pay regard to deviations in both directions (7 > and 
7 < 0), a two sided test jl2| has to be performed. 

To test the functionality of the surrogate test, we use 
10000 points of the Henon map as a first example. From 
these, we pick TV = 1000 points with their time indices 
chosen randomly. To generate surrogates, we calculate 
the Lomb periodogram for Nf — 500 frequencies in the 
interval v 6 [0,0.5]. A delay-plot of the data and one 
surrogate is shown in Fig. [l]. A little trace of the Henon 
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FIG. 3. The down-sampled data set E with one corre- 
sponding surrogate. Gaps of different sizes prevents reason- 
able interpolation. 



attractor can be found in the left figure that is built by 
pairs of values with time delay one. For the original time 
series we get 7 = —0.58, while 19 surrogates gave values 
-0.30 < 7 < 0.18. This corresponds to a 90% level of 
significance for the time series not being time reversible 
and hence nonlinear in the sense of the null hypothesis. 

In order to verify the new test method we also applied 
the test to linear time series based on an AR(l)-process 
x n+ \ = 0.95x n + rj n which is however invertibly rescaled 
by y n = Xn\fxn- Again we picked 1000 points randomly 
from the original 10000. A test performed with this data 
was unable to reject the null hypothesis, as expected since 
the null was true. 

For a quantification of the power and the size |lj] of 
the new method, many independent tests would be neces- 
sary. Such an extremely computer time-intensive study 
is beyond the scope of the present work. 

Finally, the test is applied to experimental data. Data 
Set E of the Santa Fe time series contest is a set of 
measurements of the time-integrated intensity of light 
observed from a variable star. It consists of 17 parts 
with different numbers of points, the time range of which 
partly overlaps and partly shows gaps. Inside the blocks, 
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the data is evenly sampled with A = 10 s. Special in- 
terest in low frequency components makes it desirable 
to consider the time series as a whole. The Lomb peri- 
odogram of the data set is shown in Fig. 0. 

For the surrogate test we further down-sampled the 
data by integrating over 12 successive measurements. 
Therefore, surrogates could be generated in reasonable 
time. The resulting time series is N — 2260 points long 
with A = 120 s except for 9 gaps taking up to 10000 s, as 
shown in Fig. |j. The Lomb periodogram is calculated at 
Nf — 1130 frequencies with up to v max — 1/240 Hz. The 
value for the time-reversibility statistic 7 = —0.56 x 10 -7 
s -3 of the considered time series does not lie outside the 
interval 7 e [—13 x 10~ 7 s~ 3 , 29 x 10~ 7 s -3 ] spanned sur- 
rogates, and thus the null hypothesis cannot be rejected. 
One surrogate time series is also shown in Fig. 0. 

Spurious high frequency components could be intro- 
duced by discrepancies in the overlapping parts of the 
recording. To deal with that problem we deleted points 
from one of the two parts and repeated the test. No sig- 
nificant differences in the surrogates and its values for 
the test statistics were observed. 

Taking a closer look at the individual parts shows 
considerable differences in their autocorrelations, which 
makes it dangerous to consider the whole data set as sta- 
tionary. In contrast, the generated surrogates are station- 
ary by construction. If one could detect significant differ- 
ences with a nonlinear statistic, non-stationarity would 
be an equally likely explanation as nonlinearity. 

In this paper we presented a method for a test for non- 
linearity for time series with uneven time intervals. Such 
a test consists of two main steps: generating surrogate 
data and calculating test statistics. The new method is 
able to achieve the first step using the constrained ran- 
domization scheme proposed in Ref . |ij . We offered only 
a first attempt on the second problem. More powerful 
test statistics are likely to be derivable from current non- 
linear time series methods. 
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